
Module 04: Logistic Regression
The University of Alabama
2026-03-23
| Study (\(x_1\)) | Sleep (\(x_2\)) | Result |
|---|---|---|
| 5 | 7 | ✅ Pass |
| 3 | 8 | ✅ Pass |
| 1 | 3 | ❌ Fail |
| 2 | 2 | ❌ Fail |

Key Question: How do we draw a line to separate the classes?
We want to find a line (decision boundary) that separates the two classes.
Decision Boundary: \[w_1 x_1 + w_2 x_2 + w_0 = 0\]
In general (more than 2 features): \[w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + w_0 = 0\] The model is still linear, but the boundary is a hyperplane in \(n\) dimensions.
In practice, data is rarely perfectly separable.
Goal: Find the most general linear boundary that minimizes classification error.
Key insight: We need a model that outputs a confidence score (probability), not just a hard 0/1 decision.
Let \(z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + w_0\) be the linear score.
We need a function that maps \(z \in (-\infty, +\infty)\) to a probability in \([0, 1]\).
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
| Input \(z\) | \(\sigma(z)\) | Interpretation |
|---|---|---|
| \(-\infty\) | \(0\) | Definitely Class 0 |
| \(0\) | \(0.5\) | Uncertain |
| \(+\infty\) | \(1\) | Definitely Class 1 |

Properties of \(\sigma(z)\):
Simple and elegant derivative: \[\frac{d\sigma}{dz} = \sigma(z)\big(1 - \sigma(z)\big)\]

Logistic Regression is a linear classifier that outputs a probability.
\[\hat{y} = \sigma(z) = \sigma\!\left(\sum_{i=0}^{n} w_i x_i\right) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}}\]
Prediction rule:
This is equivalent to: if \(z > 0\) → Class 1, if \(z \leq 0\) → Class 0.
Why “Regression”? Because the internal score \(z\) is computed via linear regression — logistic regression wraps it with a sigmoid to produce probabilities.
Suppose the model has learned: \(\hat{y} = \sigma\!\left(x_{\text{study}} + 2\,x_{\text{sleep}} - 8\right)\)
| Student | Study (\(x_1\)) | Sleep (\(x_2\)) | \(z\) | \(\hat{y} = \sigma(z)\) | Prediction |
|---|---|---|---|---|---|
| 1 | 5 | 7 | \(5+14-8=11\) | \(\approx 1.00\) | ✅ Pass |
| 2 | 3 | 8 | \(3+16-8=11\) | \(\approx 1.00\) | ✅ Pass |
| 3 | 1 | 3 | \(1+6-8=-1\) | \(0.269\) | ❌ Fail |
| 4 | 2 | 2 | \(2+4-8=-2\) | \(0.119\) | ❌ Fail |
The model outputs a confidence: Student 4 has only a 11.9% chance of passing — much more informative than a hard 0/1 label!
| Component | Formula / Description |
|---|---|
| Linear Score | \(z = \mathbf{w}^T \mathbf{x} = w_1x_1 + \cdots + w_nx_n + w_0\) |
| Output (probability) | \(\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}} \in [0,1]\) |
| Decision Boundary | \(z = 0 \;\Leftrightarrow\; \hat{y} = 0.5\) |
| Class 1 | \(\hat{y} > 0.5\) (i.e., \(z > 0\)) |
| Class 0 | \(\hat{y} \leq 0.5\) (i.e., \(z \leq 0\)) |
Since \(\hat{y} \in [0,1]\) is a probability, squared loss \((y - \hat{y})^2\) is technically valid but leads to a non-convex optimization landscape with many local minima.
We need a loss designed for probability outputs → Log Loss (Cross-Entropy Loss)
Intuition:

Unified: \(\quad \quad \quad \ell(y, \hat{y}) = -y\log(\hat{y}) - (1-y)\log(1-\hat{y})\)
\[\ell(y, \hat{y}) = -y\log(\hat{y}) - (1-y)\log(1-\hat{y})\]
Verify the formula:
Over the full dataset (\(N\) examples): \[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right]\]
We have the chain: \(w_i \xrightarrow{\;z = \mathbf{w}^T\mathbf{x}\;} z \xrightarrow{\;\sigma\;} \hat{y} \xrightarrow{\;\ell\;} \text{loss}\)
\[\frac{d\ell}{dw_i} = \frac{d\ell}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{dw_i}\]
Step 1 — \(\frac{dz}{dw_i}\): \(\quad \frac{dz}{dw_i} = x_i\)
Step 2 — \(\frac{d\hat{y}}{dz}\): \(\quad \frac{d\hat{y}}{dz} = \sigma(z)(1-\sigma(z)) = \hat{y}(1-\hat{y})\)
Step 3 — \(\frac{d\ell}{d\hat{y}}\): \(\quad \frac{d\ell}{d\hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)
Putting it all together:
\[\frac{d\ell}{dw_i} = \underbrace{\left[\frac{-y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right]}_{\frac{d\ell}{d\hat{y}}} \cdot \underbrace{\hat{y}(1-\hat{y})}_{\frac{d\hat{y}}{dz}} \cdot \underbrace{x_i}_{\frac{dz}{dw_i}}\]
\[= \left[\frac{-y(1-\hat{y}) + (1-y)\hat{y}}{\hat{y}(1-\hat{y})}\right] \cdot \hat{y}(1-\hat{y}) \cdot x_i\]
\[= \left[-y + y\hat{y} + \hat{y} - y\hat{y}\right] \cdot x_i = (\hat{y} - y)\, x_i\]
\[\boxed{\frac{d\ell}{dw_i} = (\hat{y} - y)\, x_i}\]
This is identical in form to the linear regression gradient — the sigmoid derivative cancels out beautifully!
Gradient: \[\frac{d\ell}{dw_i} = (\hat{y} - y)\,x_i \qquad \frac{d\ell}{dw_0} = (\hat{y} - y)\]
Update rule (gradient descent): \[w_i \leftarrow w_i - \eta\,(\hat{y} - y)\,x_i \qquad w_0 \leftarrow w_0 - \eta\,(\hat{y} - y)\]
| Step | Action |
|---|---|
| 1 | Initialize \(w_1, \ldots, w_n, w_0\) randomly (or to zero) |
| 2 | Pick a random data point \((x_1^{(i)}, \ldots, x_n^{(i)}, y^{(i)})\) |
| 3 | Compute \(z = \mathbf{w}^T\mathbf{x}^{(i)}\) |
| 4 | Compute prediction \(\hat{y}^{(i)} = \sigma(z)\) |
| 5 | Update: \(w_j \leftarrow w_j - \eta(\hat{y}^{(i)} - y^{(i)})\,x_j^{(i)}\) for all \(j\) |
| 6 | Repeat from Step 2 until convergence |
Stopping rule: when the loss change per iteration falls below a threshold \(\epsilon\).
Logistic regression is inherently binary (two classes). For \(K\) classes, we use the One-vs-All (OvA) strategy.
Procedure:
For \(K = 3\) classes: train 3 classifiers and pick the most confident one.

For \(K\) classes: \(\hat{y} = \underset{k \in \{1, \ldots, K\}}{\arg\max}\; f_k(x)\)

Each classifier learns to separate one class from the rest. Final answer: class with highest \(\hat{y}_k\).




The University of Alabama